{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 概述\n",
    "\n",
    "[Baidu ElasticSearch](https://cloud.baidu.com/doc/BES/index.html?from=productToDoc)（以下简称 BES）是一款100%兼容开源的分布式检索分析服务。为结构化/非结构化数据提供低成本、高性能及可靠性的检索、分析平台级产品服务。向量能力方面，支持多种索引类型和相似度距离算法。\n",
    "目前，BES已经和文心一言大模型联合打造方案，也广泛应用于推荐、计算机视觉、智能问答等领域；同时，BES的向量检索和存储的能力也集成进入Langchain。\n",
    "本文主要介绍基于Langchain的框架，结合BES的向量数据库的能力，对接千帆平台的模型管理和应用接入的能力，从而构建一个RAG的知识问答场景。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备工作\n",
    "用户需要保证python的版本大于等于3.9，且需要安装langchain，qianfan sdk的包。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install langchain==0.332\n",
    "!pip install elasticsearch==7.11.0\n",
    "!pip install qianfan \n",
    "!pip install pdfplumber"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 系统设置\n",
    "1. 需要在千帆大模型平台创建应用，获得接入的ak、sk\n",
    "2. 在BES的产品界面上创建一个BES集群，详见(https://cloud.baidu.com/doc/BES/s/0jwvyk4tv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ['QIANFAN_AK'] = \"your_qianfan_ak\"\n",
    "os.environ['QIANFAN_SK'] = \"your_qianfan_sk\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 开发过程\n",
    "### 文档加载、切分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.document_loaders import PDFPlumberLoader\n",
    "\n",
    "\n",
    "loader = PDFPlumberLoader(\"./example_data/ai-paper.pdf\")\n",
    "documents = loader.load()\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size = 384, chunk_overlap = 0, separators=[\"\\n\\n\", \"\\n\", \" \", \"\", \"。\", \"，\"])\n",
    "all_splits = text_splitter.split_documents(documents)\n",
    "all_splits"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 嵌入\n",
    "创建对应的嵌入算法，这里采用千帆大模型平台的 `QianfanEmbeddingsEndpoint` 接口。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import QianfanEmbeddingsEndpoint #sdk\n",
    "\n",
    "embeddings = QianfanEmbeddingsEndpoint()\n",
    "# embeddings-v1"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 向量检索和存储\n",
    "引入创建的BES集群，用Langchain内部集成的`BESVectorStore`接口创建向量存储对象(详见:[BESVectorStore](https://python.langchain.com/docs/integrations/vectorstores/baiducloud_vector_search))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores import BESVectorStore\n",
    "\n",
    "bes = BESVectorStore.from_documents(\n",
    "    documents=documents,\n",
    "    embedding=embeddings,\n",
    "    bes_url=\"your bes cluster url\",\n",
    "    index_name=\"your vector index\",\n",
    ")\n",
    "bes.client.indices.refresh(index=\"your vector index\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 生成QA检索器\n",
    "创建对应 ERNIE-Bot 或者 ERNIE-Bot-turbo 模型的 Langchain Chat Model用于进一步生成QA检索器。model 字段支持使用 `ERNIE-Bot-turbo` 或 `ERNIE-Bot`。这里我们使用 `ERNIE-Bot`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import QianfanChatEndpoint\n",
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "retriever = bes.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={'score_threshold': 0.0})\n",
    "qianfan_chat_model = QianfanChatEndpoint(model=\"ERNIE-Bot\")\n",
    "# sdk prompt load from qianfan\n",
    "qa = RetrievalQA.from_chain_type(llm=qianfan_chat_model, chain_type=\"refine\", retriever=retriever, return_source_documents=True)\n",
    "\n",
    "\n",
    "query = input(\"\\n请输入问题: \")\n",
    "res = qa(query)\n",
    "answer, docs = res['result'], res['source_documents']\n",
    "\n",
    "print(\"\\n\\n> 问题:\")\n",
    "print(query)\n",
    "print(\"\\n> 回答:\")\n",
    "print(answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "> 问题:\n",
    "2021年12月,DeepMind做了什么?\n",
    "\n",
    "> 回答:\n",
    "根据上下文，2021年12月，DeepMind发布了Gopher模型。该模型具有2800亿参数。经过152个任务的评估，Gopher比当时最先进的语言模型提高了大约81%的性能，特别是在知识密集领域，如事实检测和常识上。\n",
    "因此，正确答案是发布Gopher模型。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "58f7cb64c3a06383b7f18d2a11305edccbad427293a2b4afa7abe8bfc810d4bb"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
